Statistics for pedologists course banner image

CHAPTER 1: INTRODUCTION TO R

1.1 Introduction

R is a free, open-source software and programming language developed in 1995 at the University of Auckland that is capable of executing simple and complex mathematical, statistical, and graphical functions. It is a dialect of the S language and is case sensitive. The R interface allows you to enter data and execute functions using a command prompt (>). Soil Scientists are using R for exploring soil data, modeling soil properties or classes, validating and providing uncertainty assessments of raster-based model predictions, and developing and editing packages to expand functionality in R.

Packages are collections of code that run specific functions. They often include example data that can be used when executing those functions. While R comes with some standard, basic statistical functions; most work will require you to add additional packages. The packages are first installed and then stored in your library, where you can reference them (this does not require administrative privileges).

R has been installed on all computers with NASIS and is typically updated and CCE approved once a year. USDA machines may be 1 – 3 versions behind the latest available version for public download. Having an outdated version of R rarely creates problems for the most commonly used packages, although warnings may appear.

A few tips in R before you get started…

  • R is command-line driven and requires you to type or copy and paste commands next to the greater than sign (>) that appears when you open R. When you press “enter” on your keyboard after typing a command in the R console, the command will run. If your command is not complete, R will issue a continuation prompt (signified by a plus sign, +).

  • R has a built in editor where you can edit and select code to run. Some people find it easiest to use Notepad instead.

  • R is case sensitive, make sure your spelling and capitalization are correct!

  • Commands in R are also called functions. The basic format of a function in R is: > function.name ( argument, options)

  • The up arrow ( ↑ ) on your keyboard can be used to bring up previous commands that you’ve typed in the R console.

  • The ‘$’ is used to indicate a particular column within the dataset (dataset$column).

  • Any text that you do not want R to act on (such as instructions, information, notes or comments) needs to be preceded by a “#”. R will ignore the remainder of the script line if # is listed in an R script. For example:

plot(x~y) # This text will not affect the plot function because of the hashtag

1.2 R Graphical User Interface

Navigate to R (in the start and program menus) and Open R (3.1.1, or latest version). When you first open R, the R console window below appears:

R GUI image

The R GUI contains 3 main windows: R console, R editor, and R graphics:

R GUI image

To use R, you type commands next to the command prompt ‘>’. In the R Console window, if you type a command and press ENTER, the command will run. You are able to edit command text in R Editor and are also able to then highlight, right click, and select run for the commands that you wish to run. The R Graphics window will display only when the function plot() is run.

Commands in R can range from simple mathematical equations to complex R statistical calculations and models. As an example of simple math, if you type: 9*8+6-1 in R and hit ENTER, R will act as a calculator and return the answer, 77.

1.3 Data Management in R

WORKING DIRECTORY

Before working in R, create a folder to keep all R files in, such as  C:/R_data 

R GUI image

Before entering any commands in R, it is best to find out what the current working directory is by typing ‘getwd()’ next to the command prompt in the R Console.

getwd()

Now change the working directory to the new folder you setup. Use the backlash as follows to avoid errors in R:

setwd("C:/R_data")

The working directory can also be changed and set by clicking on File > Change dir… from the menu bar. Setting a working directory allows you to import data into R with just a file name, not an entire folder path and file name. It also is the default folder for when you save or export data out of R. Every time you start an R session, you should set your R directory

IMPORTING DATA

The basic command for importing data into R is ‘read.table’. The command is followed by the file name and then some instructions for how to read the file.

First, create an example file by copying the following contents below, starting with location, paste into Notepad and save to a file named sand_example.csv, in the  C:/R_data folder:

location,landuse,horizon,depth,sand
city,crop,A,14,19
city,crop,B,25,21
city,pasture,A,10,23
city,pasture,B,27,34
city,range,A,15,22
city,range,B,23,23
farm,crop,A,12,31
farm,crop,B,31,35
farm,pasture,A,17,30
farm,pasture,B,26,36
farm,range,A,15,25
farm,range,B,24,29
west,crop,A,13,27
west,crop,B,29,25
west,pasture,A,11,21
west,pasture,B,31,26
west,range,A,14,23
west,range,B,24,24

The ‘sand’ object in R is created by typing:

sand <- read.table("C:/R_data/sand_example.csv", header = TRUE, sep=",")

header = TRUE – indicates that the first line contains the column headers
sep=’,’ – indicates that commas are used to delimit or separate data elements
<- OR = are assignment operators in R. In the above R code, notice that after the word sand is “<-“. We assigned the name “sand” to that data object that we were importing into R. If we had omitted this, R would have indicated that the data was imported, but we would not be able to do anything with it in R besides view it.

There are other arguments besides ‘header’ and ‘sep’ that you might want to use. A quick way to find out what arguments are available for a given command is to type help(command). In this example, you would type:

help(read.table)

This command will bring up a webpage that describes all of the possible arguments for that command and usually provides examples.

EXPORTING DATA

To export data from R, use the command ‘write.table’. Since we have already set our working directory, R will automatically save our file into the folder that we specified as our working directory.

write.table(sand,file="C:/R_data/sand_example2.csv")

VIEWING DATA AND DATA OBJECTS

A few commands that you can use to view your data in R are ‘str’ and ‘names.’ ‘Str’ shows the structure of the data object and ‘names’ shows the column names (headers) of your data. You can also enter the name of the table next to the command prompt to return the entire table; avoid this if your table is large.

Enter the following commands to view your dataset in R:

names(sand)
str(sand)
sand

Data objects refer to everything you’ve created or imported and assigned a name to in R. The ‘ls’ command allows you to see what data objects are in your R session. In the figure above, you see that sand is the only data object returned. If you wanted to delete all data objects from your R session, you would type:

rm(list=ls(all=TRUE))

The ‘ls’ and ‘rm(list=ls(all=TRUE))’ functions are also available in the R GUI under the menu bar heading ‘Misc.”

R GUI image

MISSING DATA

In R, missing numerical and categorical values within a dataset are displayed with the symbol NA (not available). Impossible numerical and categorical values, like ones divided by 0, are represented by the symbol NAN (not a number). Some functions in R will not run if your data contain missing values. One way to test for missing values is to type:

is.na(sand)

R will return TRUE if there is a missing value within a given row and column or FALSE if there is not. In our sand example, there were no missing values, so R returned all combinations as FALSE. If you wanted to quickly find out which one is missing, type:

which(is.na(sand))

In our example with the sand dataset, “integer (0)” is returned because we do not have any missing values. When you have missing data and the function you want to run will not run with missing values, you have the following options:

  1. Exclude all rows or columns that contain missing values (na.exclude).
  2. Replace missing values with another value, such as zero, a global constant, or the mean or median value for that column (df[is.na(df)]<- 0) (in this example df represents data with NA values and the function is recoding all NA values as 0).
  3. Use data mining algorithm to predict missing value (data mining algorithms include decision trees, clustering, regression, etc.; see Ch. 9-11). These options will be further explored in Ch. 2, 9-11.

1.4 Data Objects

R recognizes a dozen or so data objects (structures of data) including vectors, lists, arrays, matrices, and data frames. As a soil scientist, we most often deal with data frames, like the sand file we imported in R in the exercise above. It is important to understand what data object you are using or creating and how it is handled in R.

VECTORS
Vectors are 1-dimensional ordered collections of individual elements. These elements can be numerical, character, or logical. Examples include:

R GUI image

LISTS
Lists are ordered collections of multiple R objects. In the example below, the R objects created are x, y, and z. The list function ‘list ( )’ simply serves as a storage bin for x, y, and z. Many outputs of R functions are actually lists. In fact, data frames, like the sand dataset, are actually lists.

R GUI image

ARRAYS
Arrays are multi-dimensional matrices that are limited to columns having the same data format (numeric, character, factor, etc.) and same length. The array( ) function creates arrays. The “dim” option gives the number of rows, columns, and layers, in that order.

R GUI image

MATRICES

Matrices are 2-dimensional arrays that are limited to columns having the same data format (numeric, character, factor, etc.) and same length. A common command for creating a matrix in R is the matrix function that requires the following inputs: matrix (vector, number of rows, number of columns).

R GUI image

DATA FRAMES

Data frames are matrices that allow different columns to have different data formats (numeric, character, factor, etc.) and lengths.

R GUI image

1.5 Installing and Loading Packages

Packages are collections of well-defined and referenced code developed by R users that run specific functions. They often include example data that can be used when executing those functions. While R comes with some standard, basic statistical functions; most of our work will require additional packages. In order to use a package, you must install and then load it. This can be done through command line or using the R GUI. Examples of both are provided below. R packages only need to be installed on your computer once unless R is upgraded or re-installed. Every time you start a new R session, you will have to load every package that you intend to use in that session.

COMMAND LINE

First, find out what packages have been installed by typing:

library()

To install a package that you do not have currently downloaded, type the following command:

install.packages("maps", repos='http://cran.case.edu/')
## Installing package into 'C:/Users/tom.davello/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'maps' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tom.davello\AppData\Local\Temp\1\RtmpKs1bQq\downloaded_packages

In this example, we are installing the maps package from a repository at Case Western Reserve University. The maps package will allow us to create nice base maps in R. If this was typed at the R prompt install.packages(“maps”, dep = TRUE), you would be prompted to select a CRAN mirror’ – this is a URL at a physical location that will be used to transmit data to you. The Comprehensive R Archive Network (CRAN) is a collection of sites that carry identical material for R. Choose a site that is close and reliable, for example, USA (KS) or USA (IA) are good matches for Nebraska. Select one CRAN mirror and click OK.

To use the installed package, we must load it to our current library by typing:

library(maps)

To find more documentation about the maps package; request more information from R:

??maps

This will send you to a webpage. We are interested in the maps:map documentation. At that website, you see the documentation about that function. There are a lot of options, but we’ll focus on the basics.

Useage (simple form): map(database, regions)
This means the command is “map” which will be followed by specific instructions (called arguments). In this case:

database - character string naming a geographical database, or a list of x, y, and names obtained from a previous call to map. The string choices include a world map, three USA databases (usa, state, county), and more (see the package index). The location of the map databases may be overridden by setting the R_MAP_DATA_DIR environment variable. See world for further details.

regions - character vector that names the polygons to draw. Each database is composed of a collection of polygons, and each polygon has a unique name. When a region is composed of more than one polygon, the individual polygons have the name of the region, followed by a colon and a qualifier, as in michigan:north and michigan:south. Each element of regions is matched against the polygon names in the database and, according to exact, a subset is selected for drawing. The default selects all polygons in the database.

Now we can call the map function from the maps package.

map("usa")
map("state")

When the region is left out, it defaults to showing all regions. We can specify a specific region.

map("county", "west virginia")
map("county", region=c("maryland", "virginia","west virginia"))

Now try your home state.

Try some of the examples included at the end of the map {maps} documentation, from the previous search, or http://cran.r-project.org/web/packages/maps/maps.pdf

GUI

When the R Console window is active in the R GUI (simply click on the Console window if it is not currently active), you can navigate to the ‘Packages’ drop-down menu on the menu bar. You will see options to set your CRAN mirror (physical location used to transmit data to you – select the location closest to you) and load, install, and update packages.

R GUI image

You can select more than one package to install at a time by holding down the Ctrl key.

1.6 Saving R Files

In R, there are 5 types of files that you can save to keep track of the work you do in R: workspace, script, history, R Console, and graphics. It is important to save often because R, like most software, crashes periodically especially when working with large files. Saving your work in R can be done through command prompt or the R GUI.

WORKSPACE (.RDATA)

The R workspace consists of all the data objects you’ve created or loaded during your R session. When you quit R by either typing q() or exiting out of the application window, R will prompt you to save your workspace. If you choose yes, R will save a file called “.RData” to your working directory. The next time you open R and link to the same working directory that the R.Data file is saved to, all of your data objects will be available in R. You can also save or load your workspace at any time during your R session by clicking on File tab on the menu bar.

R GUI image

R GUI image

The R command for saving your workspace is:

save.image(file="workspaceRData")

R SCRIPT (.R)

A R script is simply a text file of R commands that you’ve typed. You want to save your scripts (whether they were written in R Editor or an ancillary program like Notepad) so that you can reference them in the future, edit them if needed, and keep track of what you’ve done. In order to save R scripts in the R Gui, make sure the R Editor window is active and go to File > Save as… on the menu bar. Save scripts with .R extension. R assumes that script files are saved with only that extension. If you are using another text editor you won’t need to worry about saving your scripts in R. You can always copy and paste them in to the R Console or R Editor from your text editor.

R GUI image

To open a R script, go to File > Open script…

R GUI image

R HISTORY (.RHISTORY)

R history is very similar to R script except for the way it is displayed in R. When you open a R script file, R Editor opens and lists all of the commands that you’ve saved. You can highlight and select which commands you want to run from the R Editor window and edit the R code. With a R history file, you load it into your R Console. Once loaded, you can browse the history from the command line by pressing the up-arrow and down-arrow keys. Pressing the up-arrow key will display commands you typed one-by-one beginning with the last one you ran. You can press Enter at any time to run the command that is currently displayed. You can load and save history from the menu bar: File > Load History… or File > Save History…

R GUI image

You can also use command line:

savehistory(file = “history1.Rhistory”)
loadhistory(file = “history1.Rhistory”)

R history files do not HAVE to have the .Rhistory extension in order for R to read them, but it is preferable. R history files can also be viewed in most text editors such as Notepad.

** R CONSOLE**
Another way to save your work in R is to save the R Console by clicking on File > Save to File… in the menu bar.

R GUI image

This will save everything in the R Console window to a text file, with the default extension of .txt that can be opened in any text editor. Saving the R Console is useful if you want a print out of everything that appeared in your R Console window, including code and results. You can also print your R Console work under File > Print…

GRAPHICS
Graphic outputs can be saved in one of many formats:

R GUI image

To save a graphic: (1) Click in the Graphics Device window to bring it to focus, (2) click on File > Save as … from the menu bar, and (3) save as desired image format.

R GUI image

The R command for saving a graphic is:

jpeg (file = “vector1.jpeg”) plot(vector1) dev.off ( )

The first line of this command creates a blank file named “vector1” with a JPEG extension. The second line plots the data object that you want to create a graphic of (here it is conveniently the same name as the jpeg file we are creating). The third line closes the graphics device.

1.7 A Graphical user interface for R – RCMDR (R COMMANDER)

R Commander (Rcmdr) is an expanded GUI in R that allows users to run basic statistical functions in R using menu bars, icons, and information fields. It was created for students in introductory statistics courses so they could see how the software worked without learning a large number of command line scripts. Rcmdr is a great way to begin familiarizing yourself with R and statistics within a standardized framework.

To onstall and load Rcmdr from the R console:

install.packages(“Rcmdr”,dep=TRUE)
library(Rcmdr)

This should open the R Commander window. If it is not visible, select it from you toolbar (it may be in a list with other R windows).

IMPORTING DATA WITH Rcmdr

Navigate to the Data toolbar – scroll down to ‘Import data’ and then select ‘from Excel, Access or dBase dataset…’

R GUI image

Replace ‘Dataset’ with ‘sand2’ and select OK

R GUI image

Navigate to file location using window explorer, select proper excel file.

If we wanted to import a text file:

Navigate menus and select: Data – Import data – from text file, clipboard or url….

Complete the import box: name – change to ‘sand’, check the box ‘variable names in file, location = local file system, field separator = commas, decimal-point character = period

R GUI image

Pay attention to the field separator (csv’s use commas); decimal point (Americans typically use periods) and variable names in file (checking this box means that the header names will be included in the dataframe).

Navigate to csv file location using Windows explorer.
Now use the R Commander GUI functions to:
View Dataset – confirm that data imported with 5 columns

R GUI image

CREATING GRAPHS

Navigate Graphs menu from the menu bar.
Select – Histogram…

R GUI image

Variable – sand
Click on the Option Tab and then Number of bins – use first time, edit later
Click OK

Plot will appear in a separate window within the R GUI

Navigate back to R Commander from toolbar (may be within the R list) and select the Graphs menu again
Select – Boxplot…
Variable – sand
Then click on the Options Tab - Identify outliers – ‘automatically’

R GUI image

Click back to the Data Tab and Select ‘Plot by Groups’ button
Group variable – choose ‘landuse’
Click ‘OK’ button
Select ‘OK’ button

The plot will appear in a separate window within the R GUI

R GUI image

Boxplot output:

R GUI image
R GUI image

If you want to change something slightly, edit the command line in the script window, select the entire command and hit ‘Submit’. Notice that commands in the R Commander script window are not preceded by ‘>’ like in the R Console.

Use R help – or use an online search engine to find information about the desired function.
For instance, if you would like to change the colors of the bars in the histogram, edit the ‘col’ command:

Hist(sand2$sand, scale=“frequency”, breaks=“Sturges”, col=“blue”)

Select the line and then click the ‘Submit’ button.

SIMPLE STATISTICS

To calculate basic summary statistics use the options in the Statistics menu. A couple of examples are provided below:

Navigate to Statistics menu
Select Summary - Active dataset

Results are returned in the ‘Output Window’ and consist of a summary of the number of records for each categorical (name) variable and some basic measures of the continuous (numeric) variables.

R GUI image

Navigate back to the Statistics menu and Select Means - One-way ANOVA

If you wish to reuse your model, give it a unique name in the ‘Enter name for model:’ field

Groups – landuse
Response Variable – sand
Hit OK
R GUI image

Example Ouput:

R GUI image

This result indicates that the sand content of these landuses are not significantly different ( Pr = 0.422). It should be noted that this simple analysis has not accounted for the two kinds of horizons analyzed (A and B) or the non-independent nature of multiple samples collected at each location. It also doesn’t tell you if comparing sand content between land uses was a reasonable thing to do.

SAVING R SCRIPTS

For record keeping, you can copy and paste script and results into Excel or other compatible software. You can also save your script for later use in R commander or R Console. To Save a script: Edit the ‘Script Window’ to reflect the analyses you want to recreate (including the data importation step).
Navigate to the File menu
Save script as…
Rename sand_rcmdr.R and save in your working directory
The same steps can be used to save the output window using ‘save ouput as’. Minimize the R Commander window and move to the next section.

IMPORTING AND EXECUTING A SAVED R COMMANDER SCRIPT
R COMMANDER

You can open a saved R commander script in R commander or in R editor. Sometimes, however, scripts saved with R Commander will have things encoded for R commander that aren’t apparent and won’t run directly from R editor. First we’ll open the script with R commander that we just saved and then open it in the R editor. Using the saved R commander script in R Commander,

Open and view R Commander
Navigate to File, Open Script file sand_rcmdr.R

It will ask you if you want to save the current log file (hit no to clear without saving). Your saved script now appears in the ‘Script Window’.

Place your cursor on any command line and hit submit.

You will see the ‘output’ and graphs display as they did when you first executed them through the menu system of R Commander.

R CONSOLE
Open the RGUI console. Navigate to the File menu and select ‘Open script.’

Use windows explorer to navigate to the previously saved script sand_rcmdr.R
A new window will open up within R - ‘R editor’.
Select the first two lines of script using your mouse to left-click and drag.
Then right click with your mouse and select ‘Run line or section’

** sand2 <- read.table(“C:/R_data/sand_example.csv”, header=TRUE, sep=“,”, na.strings=“NA”, dec=“.”, strip.white=TRUE)**

R GUI image

This is a data input step – next time you open R, you will need to import the dataset again. If you update the file sand_example.csv, the changes will be reflected when you rerun the analysis.

Select the next line and ‘run’
library(relimp, pos=4)

This loads a package that R Commander used to run the script. Next, select the lines for any command that you wish to execute, for instance:
Hist(sand2$sand, scale=“frequency”, breaks=“Sturges”, col=“blue”)

This recreates the histogram graph. Enter the command:
help(hist)

to get more information about how to use this function including its usage and arguments that you can modify the default output. We can edit this in R editor and save our changes for later.
Hist(sand2$sand, scale=“frequency”, breaks=“Sturges”, col=“lightblue”, xlab = “Total Sand”)

In this example, we’ve changed the label of the x-axis and the color of the graph. At the top of the R editor window enter:
#this is a demo of how to use R editor with an R Commander Script Now select all and run the entire script. Save your R script using the file icon or in the File menu and Close R.

1.8 R Studio

R studio is a ‘super-package’ that allows you to interact with R more easily. If it is not already installed on your computer, you can install it from within R (does not require admin privileges) at http://rstudio.org/download/desktop# using a tarball or zip file. Follow the instructions and work with ITS to get the package installed.

Navigate to R Studio from your start menu. When you open R Studio, you will see your screen split into quadrants:

  • Source – these are script files that you have saved or are creating
  • Console – this is the command prompt window for R
  • Workspace – keeps track of all data in use (which can be clicked and viewed through the source)
  • Plots – input and output space, includes files, packages and graphs that you create

R GUI image

Rstudio Support has examples in the Knowledge Base that will help you use R studio.

WORKING WITH SCRIPTS

Open the script you created with R commander and edited in R editor by going to the File menu >
Open File…
Open sand_rcmdr.R
Run commands from within the script by selecting the first command (first three lines) and hit ‘run’ from the task bar above the script.

R GUI image

Notice that the command line is passed to the console (lower left) and the data file appears in the workspace (upper right) as sand2.
Click on the file name sand2 under the Environment Tab in the top left quadrant; notice that the console will show the corresponding command prompt ‘>view(sand2)’

R GUI image

Switch back to the sand_rcmdr.R script window and select the command to create a histogram (begins with ‘hist’) and hit ‘run’
You will get an error message “Error: could not find function “Hist”
This is a function of the packages that loaded when you installed and loaded Rcmdr, but are not available now. Go to the console, at command prompt (>) push the up arrow and edit the command to have a lower case ‘h’ as in ‘hist’ hit enter;
Now edit the ‘Hist’ command in the R script; do the same for Boxplot – to boxplot

R GUI image

Select both commands (hist and boxplot) and hit ‘run again. Ignore the warning message. You will see the most recent graph in the Plot window; use the arrows to scroll through all graphs produced during this session. Save your script by first clicking in the R script window and then navigating to File > Save As… Name the file: sand_studio.R

HELP FEATURES OF R STUDIO

To learn more about the function you are using and the options/arguments available; take advantage of some of the help functions in R studio. Place your cursor next to the command prompt (>) in the console (lower left).
Type: > hist
Place your cursor at the end of ‘hist’ and hit tab – you’ll see a brief explanation of the functions and the name of the package it comes from {this can be handy for searching. Hit the ‘F1’ key to get further explanation (equivalent to help(hist) in the console. Information will appear in the Help window (lower right).

R GUI image

Look through the usage and arguments. Reenter ‘hist’ function; evaluate the effects of changing color, breaks, freq, and labels.

hist(sand2$sand, freq=TRUE, breaks=12, xlim = c(15, 40), main = “Histogram of Sand”, sub = “with 12 bins”, col =“lightblue”, ylab = “Counts”, xlab = “Total Sand”)

hist(sand2$sand, freq=TRUE, breaks=5, xlim = c(15, 40), main = “Histogram of Sand”, sub = “with 5 bins”, col =“lightblue”, ylab = “Counts”, xlab = “Total Sand”)

Notice how changing the ‘breaks’ argument alters the appearance of the graph. The breaks argument tells R how the individual values should be counted in bins or groups. The ‘xlim’ argument tells R where to set the upper and lower limit of the x-axis.

Now try:

hist(sand2$sand, freq=FALSE, breaks = c(10,15,20,35,40), xlim = c(10, 40), main = “Histogram of Sand”, sub = “with predefined bins”, col =“lightblue”, ylab = “Counts”, xlab = “Total Sand”)

Note that this arbitrarily sets the bin breaks using a list – c(x1,x2,x3…..). This can be a good way to separate groups, but in a way that may alter the way you visualize the distribution. This will work for any function in the console command prompt. You can also search for help on any function (even if you don’t have the package installed)

help.search(“histogram”)